A Metadata Generation System for Scanned Scientic Volumes

نویسندگان

  • Xiaonan Lu
  • Brewster Kahle
  • James Z. Wang
  • C. Lee Giles
چکیده

Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. In this work, we tackle the problem of extracting metadata from scanned volumes of journals. Our goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. We propose methods for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from OCRed text. We show the performance of our system on scanned bound historical documents nearly two centuries old. We have developed the system and integrated it into an operational digital library, the Internet Archive, for realworld usage.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Metadata Enrichment for Automatic Data Entry Based on Relational Data Models

The idea of automatic generation of data entry forms based on data relational models is a common and known idea that has been discussed day by day more than before according to the popularity of agile methods in software development accompanying development of programming tools. One of the requirements of the automation methods, whether in commercial products or the relevant research projects, ...

متن کامل

Paper to Screen: Processing Historical Scans in the ADS

The NASA Astrophysics Data System in conjunction with the Wolbach Library at the Harvard-Smithsonian Center for Astrophysics is working on a project to microfilm historical observatory publications. The microfilm is then scanned for inclusion in the ADS. The ADS currently contains over 700,000 scanned pages of volumes of historical literature. Many of these volumes lack clear pagination or othe...

متن کامل

Toward Enhanced Metadata Quality of Large-Scale Digital Libraries: Estimating Volume Time Range

Metadata is a special type of data that describes data. In the age of Big Data, the role of metadata has become more prominent–it is obvious that big data needs high-quality metadata description as it becomes less and less possible for humans to go over all the data (if human readable) with the exponential growth of data sets. In this study we try to enhance metadata records (publication dates)...

متن کامل

Literature-driven Curation for Taxonomic Name Databases

Digitized biodiversity literature provides a wealth of content for using biodiversity knowledge by machines. However, identifying taxonomic names and the associated semantic metadata is a difficult and labour intensive process. We present a system to support human assisted creation of semantic metadata. Information extraction techniques automatically identify taxonomic names from scanned docume...

متن کامل

Evolving Metadata Needs for an Institutional Repository: MIT's DSpace

As the DSpace digital repository system develops, various metadata needs have emerged to accommodate the differing uses being made of the system. In the initial stages of the project a qualified form of the Dublin Core metadata set was developed for use within the system. Subsequently it became evident that metadata would also have to be imported from existing MARC sources for batch loads of sc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008